Adil Yıldız
IT Expert - Data Science Enthusiast - Technophile
I am a continuous learner, reliable, dedicated and hardworking IT Expert with 15+ years of experience. I have worked for the leading companies, having different roles. I am skilled mainly in devops, highly available service management, technical project management, data science and product management.
This page is dedicated to my tiny steps in data science. I have a degree in this field and i am trying to extend my knowledge with real life applications. I will share here mainly anything useful, practical and appliable. Please contact with me for any positive or negative feedback or questions if you have.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
The goal of this study is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. Any of the other variables may be used to predict with. I will explain step by step how i built my model, how i used cross validation, what i think the expected out of sample error is, and why i made the choices i did. I will also use the decided prediction model to predict 20 different test cases at the end of the study.
After the preprocessing of the data we will have 3 data sets on hand: training, testing and pml_testing. I will use training to train the model. The testing set will actually be used for model selection(validation). Pml_testing set will be used for making the predictions with the selected model.
setwd("C:/Users/Acer-nb/Downloads/ML_Project")
# read csv data with na and blank values set to na
pml_training <- read.csv("pml-training.csv",na.strings=c("NA","NaN", " ",""))
pml_testing <- read.csv("pml-testing.csv",na.strings=c("NA","NaN", " ",""))
# remove columns with na values
pml_train <- pml_training[,!colSums(is.na(pml_training))>0]
pml_test <- pml_testing[,!colSums(is.na(pml_testing))>0]
dropcl <- grep("name|timestamp|window|X", colnames(pml_train), value=F)
pml_training <- pml_train[,-dropcl]
dropcl <- grep("name|timestamp|window|X", colnames(pml_test), value=F)
pml_testing <- pml_test[,-dropcl]
#
set.seed(1234)
inTrain <- createDataPartition(y=pml_training$classe,p=0.8, list=FALSE)
training <- pml_training[inTrain,]
testing <- pml_training[-inTrain,]
After removing unrequired columns there are 52 predictors in the data. This is quite a lot, so we may not need all of them in our model.
# Make a matrix of correlations of all predictors
M <- abs(cor(training[,-53]))
# Set the diagonal to zero (the correlation of a predictor with itself, it's 1, we know, so we should remove it)
diag(M) <- 0
# Find the parameters having correlation over a threshold.
which(M > 0.8,arr.ind=T)
## row col
## yaw_belt 3 1
## total_accel_belt 4 1
## accel_belt_y 9 1
## accel_belt_z 10 1
## accel_belt_x 8 2
## magnet_belt_x 11 2
## roll_belt 1 3
## roll_belt 1 4
## accel_belt_y 9 4
## accel_belt_z 10 4
## pitch_belt 2 8
## magnet_belt_x 11 8
## roll_belt 1 9
## total_accel_belt 4 9
## accel_belt_z 10 9
## roll_belt 1 10
## total_accel_belt 4 10
## accel_belt_y 9 10
## pitch_belt 2 11
## accel_belt_x 8 11
## gyros_arm_y 19 18
## gyros_arm_x 18 19
## magnet_arm_x 24 21
## accel_arm_x 21 24
## magnet_arm_z 26 25
## magnet_arm_y 25 26
## accel_dumbbell_x 34 28
## accel_dumbbell_z 36 29
## gyros_dumbbell_z 33 31
## gyros_forearm_z 46 31
## gyros_dumbbell_x 31 33
## gyros_forearm_z 46 33
## pitch_dumbbell 28 34
## yaw_dumbbell 29 36
## gyros_forearm_z 46 45
## gyros_dumbbell_x 31 46
## gyros_dumbbell_z 33 46
## gyros_forearm_y 45 46
As seen in the results some variables are correlated, which means some of them should not exist in the model. The feature selection should be performed and the selected fewer features should be used to construct the models.
The following exploratory graphs shows that some features are quite helpful to seperate the classes, so we may expect high accuracy levels when predicting the outcome.
plot_ly(training,color =training$classe,y=training[,1],type="box")
plot_ly(training,color =training$classe,y=training[,3],type="box")
The following models have been set with 25 features extracted by PCA from the data. Even after this kind of a feature selection method, the random forest model takes too long to run.
# Create as many components as required to explain %95 of the variance
preProc <- preProcess((training[,-53]+1),method=c("center","scale","pca"),thresh = 0.95)
trainPC <- predict(preProc,(training[,-53]+1))
testPC <- predict(preProc,(testing[,-53]+1))
mod1 <- train(x=trainPC, y=training$classe, method="lda")
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
mod2 <- train(x=trainPC, y=training$classe, method="knn")
pred1 <- predict(mod1,testPC)
pred2 <- predict(mod2,testPC)
Even after simplifying with PCA the models takes too long to run, so i tried to perform another method to make the models simpler.
The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS). I tried this method on the data as follows:
marsModel <- earth(classe~., data=training)
ev <- evimp (marsModel,trim = FALSE)
ev
## nsubsets gcv rss
## roll_belt 21 100.0 100.0
## magnet_dumbbell_y 20 90.6 90.6
## roll_forearm 19 85.6 85.6
## accel_belt_z 17 72.8 72.9
## magnet_dumbbell_z 16 70.0 70.1
## yaw_belt 15 66.0 66.1
## roll_dumbbell 13 56.2 56.4
## total_accel_dumbbell 12 51.9 52.1
## pitch_belt 10 43.5 43.7
## pitch_forearm 7 31.9 32.1
## total_accel_belt-unused 0 0.0 0.0
## gyros_belt_x-unused 0 0.0 0.0
## gyros_belt_y-unused 0 0.0 0.0
## gyros_belt_z-unused 0 0.0 0.0
## accel_belt_x-unused 0 0.0 0.0
## accel_belt_y-unused 0 0.0 0.0
## magnet_belt_x-unused 0 0.0 0.0
## magnet_belt_y-unused 0 0.0 0.0
## magnet_belt_z-unused 0 0.0 0.0
## roll_arm-unused 0 0.0 0.0
## pitch_arm-unused 0 0.0 0.0
## yaw_arm-unused 0 0.0 0.0
## total_accel_arm-unused 0 0.0 0.0
## gyros_arm_x-unused 0 0.0 0.0
## gyros_arm_y-unused 0 0.0 0.0
## gyros_arm_z-unused 0 0.0 0.0
## accel_arm_x-unused 0 0.0 0.0
## accel_arm_y-unused 0 0.0 0.0
## accel_arm_z-unused 0 0.0 0.0
## magnet_arm_x-unused 0 0.0 0.0
## magnet_arm_y-unused 0 0.0 0.0
## magnet_arm_z-unused 0 0.0 0.0
## pitch_dumbbell-unused 0 0.0 0.0
## yaw_dumbbell-unused 0 0.0 0.0
## gyros_dumbbell_x-unused 0 0.0 0.0
## gyros_dumbbell_y-unused 0 0.0 0.0
## gyros_dumbbell_z-unused 0 0.0 0.0
## accel_dumbbell_x-unused 0 0.0 0.0
## accel_dumbbell_y-unused 0 0.0 0.0
## accel_dumbbell_z-unused 0 0.0 0.0
## magnet_dumbbell_x-unused 0 0.0 0.0
## yaw_forearm-unused 0 0.0 0.0
## total_accel_forearm-unused 0 0.0 0.0
## gyros_forearm_x-unused 0 0.0 0.0
## gyros_forearm_y-unused 0 0.0 0.0
## gyros_forearm_z-unused 0 0.0 0.0
## accel_forearm_x-unused 0 0.0 0.0
## accel_forearm_y-unused 0 0.0 0.0
## accel_forearm_z-unused 0 0.0 0.0
## magnet_forearm_x-unused 0 0.0 0.0
## magnet_forearm_y-unused 0 0.0 0.0
## magnet_forearm_z-unused 0 0.0 0.0
As the model recommends only 10 of the variables have effect on the outcome. So i made the subset of the data with these 10 variables.
training_imp <- subset(training,select = c(classe,roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
testing_imp <- subset(testing,select = c(classe,roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
pml_testing_imp <- subset(pml_testing,select = c(roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
I built 4 models with this subset.
mod11 <- train(classe~.,data=training_imp,method="rf",preProcess=c("center","scale"))
mod12 <- bagging(classe~.,data=training_imp,preProcess=c("center","scale"))
mod13 <- C5.0(classe~.,data=training_imp,preProcess=c("center","scale"))
mod14 <- train(classe~.,data=training_imp,method="knn",preProcess=c("center","scale"))
pred11 <- predict(mod11,testing)
pred12 <- predict(mod12,testing)
pred13 <- predict(mod13,testing)
pred14 <- predict(mod14,testing)
confusionMatrix(predict(mod14,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1096 30 5 2 0
## B 9 671 18 7 10
## C 6 23 629 26 12
## D 3 31 26 605 12
## E 2 4 6 3 687
##
## Overall Statistics
##
## Accuracy : 0.9401
## 95% CI : (0.9322, 0.9473)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9242
## Mcnemar's Test P-Value : 2.215e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.8841 0.9196 0.9409 0.9528
## Specificity 0.9868 0.9861 0.9793 0.9780 0.9953
## Pos Pred Value 0.9673 0.9385 0.9037 0.8936 0.9786
## Neg Pred Value 0.9928 0.9726 0.9830 0.9883 0.9894
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2794 0.1710 0.1603 0.1542 0.1751
## Detection Prevalence 0.2888 0.1823 0.1774 0.1726 0.1789
## Balanced Accuracy 0.9844 0.9351 0.9495 0.9595 0.9741
confusionMatrix(predict(mod13,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1090 11 8 1 1
## B 21 721 8 3 6
## C 4 15 651 11 6
## D 1 8 14 626 3
## E 0 4 3 2 705
##
## Overall Statistics
##
## Accuracy : 0.9669
## 95% CI : (0.9608, 0.9722)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9581
## Mcnemar's Test P-Value : 0.2972
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9767 0.9499 0.9518 0.9736 0.9778
## Specificity 0.9925 0.9880 0.9889 0.9921 0.9972
## Pos Pred Value 0.9811 0.9499 0.9476 0.9601 0.9874
## Neg Pred Value 0.9908 0.9880 0.9898 0.9948 0.9950
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2778 0.1838 0.1659 0.1596 0.1797
## Detection Prevalence 0.2832 0.1935 0.1751 0.1662 0.1820
## Balanced Accuracy 0.9846 0.9690 0.9703 0.9828 0.9875
confusionMatrix(predict(mod12,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1106 8 2 0 3
## B 3 732 4 2 5
## C 5 10 672 6 3
## D 1 7 6 635 3
## E 1 2 0 0 707
##
## Overall Statistics
##
## Accuracy : 0.9819
## 95% CI : (0.9772, 0.9858)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9771
## Mcnemar's Test P-Value : 0.05179
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9910 0.9644 0.9825 0.9876 0.9806
## Specificity 0.9954 0.9956 0.9926 0.9948 0.9991
## Pos Pred Value 0.9884 0.9812 0.9655 0.9739 0.9958
## Neg Pred Value 0.9964 0.9915 0.9963 0.9976 0.9956
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2819 0.1866 0.1713 0.1619 0.1802
## Detection Prevalence 0.2852 0.1902 0.1774 0.1662 0.1810
## Balanced Accuracy 0.9932 0.9800 0.9875 0.9912 0.9898
confusionMatrix(predict(mod11,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 3 1 0 0
## B 2 743 5 0 2
## C 2 7 677 4 0
## D 0 6 1 639 1
## E 0 0 0 0 718
##
## Overall Statistics
##
## Accuracy : 0.9913
## 95% CI : (0.9879, 0.994)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.989
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9789 0.9898 0.9938 0.9958
## Specificity 0.9986 0.9972 0.9960 0.9976 1.0000
## Pos Pred Value 0.9964 0.9880 0.9812 0.9876 1.0000
## Neg Pred Value 0.9986 0.9950 0.9978 0.9988 0.9991
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1894 0.1726 0.1629 0.1830
## Detection Prevalence 0.2845 0.1917 0.1759 0.1649 0.1830
## Balanced Accuracy 0.9975 0.9880 0.9929 0.9957 0.9979
All models performed quite satisfactory, but the winner was mod14, built with random forest with 0.99 accuracy.
Here are the results with the small test data of 20 observations:
predict(mod14,pml_testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
All test activities are predicted correctly, so the model works well.
The classes in testing data are predicted with 99% accuracy, which is almost a perfect score. The other performance metrics are also very high. This means that we have a very small out-of-sample error rate, but a question raises here, is there an over-fitting issue? There should not be. The number of observations are quite sufficient and i do not expect to observe very different variations in real life.
Detecting multicollinearity and selection of significant variables is an important preparation task in prediction model generation. There are different technics to manage this task, here are my study notes on this topic.
I will use Boston data set in R to demonstrate these technics.
data("Boston")
boston <- as.data.frame(Boston)
str(boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston data contains 1978’s housing information in Boston. You can simply type ?boston to see the details. I will try to estimate the housing prices using the predictors. Medv variable is the corresponding median value of the housing prices(in $1000s) and the other variables are the predictors.
First thing generally we need to do is to create the training and testing sets.
set.seed(1234)
inTrain <- createDataPartition(y=boston$medv,p=0.7, list=FALSE)
training <- boston[inTrain,]
testing <- boston[-inTrain,]
dim(testing);dim(training)
## [1] 150 14
## [1] 356 14
If your data does not have too many dimension following command simply gives a general look of your data and you can shortly see what’s going on in general.
plot(boston)
The plot of pairs shows that there are some trends in data and this gives an idea of prediction may work well but i do not go into how to interpret plots now. Instead, we can make the correlation matrix of the data which makes it easier.
# Make a matrix of correlations of all predictors
M <- abs(cor(training[,-14]))
# Set the diagonal to zero (the correlation of a predictor with itself, it's 1, we know, so we should remove it)
diag(M) <- 0
# Find the parameters having correlation over a threshold.
which(M > 0.7,arr.ind=T)
## row col
## nox 5 3
## dis 8 3
## tax 10 3
## indus 3 5
## age 7 5
## dis 8 5
## nox 5 7
## dis 8 7
## indus 3 8
## nox 5 8
## age 7 8
## tax 10 9
## indus 3 10
## rad 9 10
Some variables seems to have multicollinearity, we may manually remove them when building the model but since i would like to demonstrate some technics for feature selection i will not do that.
Eigen system analysis is a technic which is not only pairwise but also has more complicated analysis. When the values are close, we can say that multi-collinearity does not exist. But vice versa, like in this data, there should be a problem.
eigen(cor(training))$values
## [1] 6.53779955 1.67117293 1.38710353 0.95993427 0.85284117 0.64903850
## [7] 0.47742824 0.40464360 0.25564310 0.22947201 0.20946720 0.17785106
## [13] 0.13408959 0.05351526
# Condition Number: Ratio of max to min Eigen values of the correlation matrix
max(eigen(cor(training))$values)/min(eigen(cor(training))$values)
## [1] 122.167
kappa(cor(training), exact = TRUE)
## [1] 122.167
Intuitionally we may know that crime ratio is an important effect
plot_ly(training, x = ~crim, y = ~medv, color = ~crim,
size = ~crim, text = ~paste("Zn: ", zn),type = "scatter", mode="markers" )
Graph shows that crime ratio has a trend with housing prices as expected.
The same visualization can be made for other features as well.
plot_ly(training, x = ~indus, y = ~medv, color = ~indus,
size = ~indus, text = ~paste("Zn: ", zn),type = "scatter", mode="markers" )
plot_ly(training, x = ~chas, y = ~medv,type = "box")
How about checking the variables which are found having collinearity? E.g. nox with crim.
plot_ly(training, x = ~nox, y = ~medv, color = ~nox,
size = ~nox, text = ~paste("Zn: ", zn),type = "scatter", mode="markers" )
Similar trend with medv exists in nox and crim features. Let’s move ahead.
Variance inflation factor is a technic that can be used for feature selection.
model = lm(medv~., data = training)
vif(model)
## crim zn indus chas nox rm age
## 2.128848 2.387983 4.084376 1.076718 4.359516 1.870169 3.305758
## dis rad tax ptratio black lstat
## 4.120336 8.443791 10.152816 1.812406 1.405746 2.929285
By using vif values and feature significancy in the model, feature are eliminated to reduce variance inflation and to have the significant features in the model.
The following model seems to fit good according to this method.
model = lm(medv~.-dis-indus-chas-age-zn-nox-crim-tax-rad, data = training)
summary(model)
##
## Call:
## lm(formula = medv ~ . - dis - indus - chas - age - zn - nox -
## crim - tax - rad, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.142 -2.856 -0.662 1.682 30.585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.319134 4.664624 2.855 0.00455 **
## rm 4.694079 0.466507 10.062 < 2e-16 ***
## ptratio -0.960232 0.128464 -7.475 6.23e-13 ***
## black 0.010218 0.003229 3.165 0.00169 **
## lstat -0.513637 0.051092 -10.053 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.876 on 351 degrees of freedom
## Multiple R-squared: 0.7104, Adjusted R-squared: 0.7071
## F-statistic: 215.2 on 4 and 351 DF, p-value: < 2.2e-16
vif(model)
## rm ptratio black lstat
## 1.626664 1.196178 1.210080 1.900563
pred <- predict(model, testing)
The following plot shows testing values against predictions. The trend looks linear, so we can say model made OK. But there are some outliers, which means that there is some variance which is not be able to explained by the model. The study
plot_ly(testing, x = pred, y = ~medv,type = "scatter",mode="markers")
plot(model)
So we should better include some other variables, but which?
The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS). I tried this method on the data as follows:
marsModel <- earth(medv~., data=training)
ev <- evimp (marsModel,trim = FALSE)
ev
## nsubsets gcv rss
## rm 19 100.0 100.0
## lstat 18 62.2 63.4
## ptratio 16 31.5 34.8
## nox 15 27.6 31.1
## dis 14 26.9 30.1
## crim 11 19.8 23.2
## tax 9 14.4 18.3
## rad 7 10.8 14.8
## black 5 4.8 10.2
## indus 3 4.0 8.0
## zn-unused 0 0.0 0.0
## chas-unused 0 0.0 0.0
## age-unused 0 0.0 0.0
Our model has rm, ptratio, black and lstat. So we correctly select the first 3 features. black is also in the features Earth package finds as significant but there are more important features. So let’s update the model according to this.
model = lm(medv~.-indus-chas-age-zn-black-rad, data = training)
summary(model)
##
## Call:
## lm(formula = medv ~ . - indus - chas - age - zn - black - rad,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6172 -2.8600 -0.5911 2.0034 28.4226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.482e+01 5.370e+00 6.485 3.04e-10 ***
## crim -4.968e-02 4.773e-02 -1.041 0.298595
## nox -1.484e+01 4.057e+00 -3.658 0.000293 ***
## rm 4.180e+00 4.557e-01 9.173 < 2e-16 ***
## dis -1.068e+00 1.855e-01 -5.760 1.85e-08 ***
## tax -9.173e-04 2.535e-03 -0.362 0.717681
## ptratio -1.016e+00 1.400e-01 -7.259 2.57e-12 ***
## lstat -5.688e-01 5.652e-02 -10.062 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.727 on 348 degrees of freedom
## Multiple R-squared: 0.7301, Adjusted R-squared: 0.7247
## F-statistic: 134.5 on 7 and 348 DF, p-value: < 2.2e-16
vif(model)
## crim nox rm dis tax ptratio lstat
## 1.819539 3.666052 1.651545 2.537302 2.815370 1.510923 2.474887
pred <- predict(model, testing)
After adding the features the model slightly performs better. Rsquared increased to 73% from 71% and residual standart error decreased to 4.7 from 4.8. It is another decision to have a slightly better model but with a few extra terms. Sometimes simplicity may be more important than accuracy.
plot_ly(testing, x = pred, y = ~medv,type = "scatter",mode="markers")
plot(model)
The following plot shows real values vs. predictions. Real values are light gray and as you can see some values are quite far from the trend and this is probably because of unexplained variations in the data. Perhaps there are some other factors(not in data) effecting the house prices and we have no information about them.
plot(testing$lstat,testing$medv, col="lightgrey")
points(testing$lstat,pred,col="red")
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!